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Abstract 

In this paper we present a novel sampling-based numerical scheme designed to solve a certain 
class of stochastic optimal control problems, utilizing forward and backward stochastic differ¬ 
ential equations (FBSDEs). By means of a nonlinear version of the Feynman-Kac lemma, we 
obtain a probabilistic representation of the solution to the nonlinear Hamilton-Jacobi-Bellman 
equation, expressed in the form of a decoupled system of FBSDEs. This system of EBSDEs can 
then be simulated by employing linear regression techniques. To enhance the efficiency of the 
proposed scheme when treating more complex nonlinear systems, we then derive an iterative 
modification based on Girsanov’s theorem on the change of measure, which features importance 
sampling. The modified scheme is capable of learning the optimal control without requiring an 
initial guess. We present simulations that validate the algorithm and demonstrate its efficiency 
in treating nonlinear dynamics. 


1 Introduction 

The problem of obtaining an optimal control in a stochastic setting is typically associated with 
the solution of a generally nonlinear second-order partial differential equation (PDE), known as 
the Hamilton-Jacobi-Bellman (HJB) equation. Different methods can be distinguished based on 
whether they seek a solution over the entire domain, or locally around a nominal system trajectory. 
In the first case, several attempts have been made to address the inherent difficulty in solving such 
nonlinear PDEs, as well as the curse of dimensionality [1-5]. The latter case, on the other hand, 
includes methods such as Stochastic Differential Dynamic Programming [6, 7], which is based 
on linearization of the dynamics and a quadratic approximation of the cost functions around 
nominal trajectories, as well as sampling-based methods, which are distinguished for their ability to 
accommodate scalable iterative schemes. Various sampling-based methods appear in the literature 
under the names Path Integral control (PI) [8,9], Kullback Leibler (KL) control, or Linearly 
Solvable Control [10,11]. 

The fundamental characteristic of all aforementioned sampling-based methods is that they 
rely on the exponential transformation of the Value function [12], and the linear version of the 
Eeynman-Kac lemma [13]. Under the exponential transformation, and by introducing certain 
restrictions between control authority and stochasticity, there exists a direct relationship between 
the Hamilton-Jacobi-Bellman PDE and the backward Chapman-Kolmogorov PDE. The latter, 
being a linear PDE, enables the use of the linear Eeynman-Kac formula, which relates certain linear 
backward PDEs to forward stochastic differential equations (SDEs). Thus, the corresponding 
optimal control problem can be solved using forward sampling. While forward sampling-based 
methods exhibit several advantages against traditional methods of stochastic control, such as 


the mild conditions on the differentiability of the cost and the stochastic dynamics, there are 
also some key disadvantages which pertain to the nature of the exponential transformation. In 
particular, the effect of the exponential transformation can be identified as the mapping of the 
value function v{t,x), which has range [0,oo), to the desirability function whose range 

is (0,1]. This mapping leads to a drastic reduction in the ability to distinguish states with high 
cost (low desirability) from states with low cost (high desirability). This issue has been partially 
addressed with renormalization of the trajectory cost [7]. Finally, while the necessary constraint 
introduced between control authority and stochasticity can lead to symmetry breaking phenomena 
and delayed decision [14] , it is a rather restrictive assumption whenever applications to engineered 
systems are considered. 

In this paper we present a learning control algorithm which capitalizes on the innate rela¬ 
tionship between certain nonlinear PDFs and Forward and Backward SDEs, demonstrated by a 
nonlinear version of the Feynman-Kac lemma. By means of this lemma, we obtain a probabilistic 
representation of the solution to the nonlinear HJB PDF, expressed in the form of a system of 
decoupled FBSDFs. This system of FBSDFs can then be simulatedby employing linear regression 
techniques. We wish to highlight the difference between the proposed approach and already ex¬ 
isting sampling-based formulations: our approach addresses directly the nonlinear PDF, while the 
latter make use of the exponential transformation, which under certain conditions yields a linear 
PDE problem, and then use forward sampling to address that linear problem. Thus, the herein 
proposed framework relaxes these restrictive conditions. To enhance the efficiency of the proposed 
scheme when treating more complex nonlinear systems, we then derive an iterative algorithm 
based on Girsanov’s theorem on the change of measure, which features importance sampling. The 
proposed scheme is capable of learning the optimal control without requiring an initial guess. We 
present simulations that validate the algorithm and demonstrate its efficiency in treating nonlinear 
dynamics. 


2 Problem Statement 


Let (D, {^i}t> 0 )IP’) be a complete, filtered probability space on which is defined ap-dimensional 

standard Brownian motion Wt, such that {^t}t>o is the natural filtration of Wt augmented by all 
P-null sets. Consider the problem of minimizing the expected cost defined by the cost functional 


J(r, Xr\ u{-)) = E 


g{x{T)) + 


q{t, x{t)) + -u {t)Ru{t)dt 


( 1 ) 


associated with the stochastic controlled system which is represented by the Ito stochastic differ¬ 
ential equation (SDE) 


dx{t) = /(t, x{t))dt + G{t, x{t))u{t)dt + S(t, x{t))dWt, t G [r, T], 
x{t) = Xr, 


where T > r > 0, T is a fixed time of termination, x G M” is the state vector, u G is the 

control vector, R \s a v x u positive definite matrix, and g : M"" —>■ M, g : [0, T] x —>■ M, 

/ : [0,T] X —)■ M”, G : [0, T] x M" —)■ and S : [0, T] x M" —)■ are deterministic 

functions, that is, they do not depend explicitly on w G D. We assume that all standard technical 
conditions [15] which pertain to the filtered probability space and the regularity of functions 
are met, in order to guarantee existence, uniqueness of solutions to (2), and a well defined cost 
functional (1). These impose for example that the functions g, q, f, G and S are continuous w.r.t. 
time t (in case there is explicit dependence), Lipschitz (uniformly in t) with respect to the state 




variables and satisfy standard growth conditions over the domain of interest. Furthermore, the 
square-integrable process u ■. [0,r] x —)• [/ C is {^t}i>o-adapted (also called progressively 
measurable), which essentially translates into the control input being non-anticipating, i.e., relying 
only on past and present information. We denote the set of all admissible [/-valued functions 
as ^[0,r]. For any given initial condition (r, x,-), we wish to minimize (1) under all admissible 
functions u{-) G '^[0,r]. We define the Value function V as 

{ V(T,Xr) = inf (T,Xr) G [0, T) X M” 

ui-)e'W[o,T] ^ 3 ^ 

V{T,x)=g{x), xGM”. 


By applying the stochastic version of Bellman’s principle of optimality, it is shown [15,16] that 
if the Value function is in x M”), then it is a solution to the following terminal value 

problem of a second-order partial differential equation, known as the Hamilton-Jacobi-Bellman 
(HJB) equation, which, for the problem at hand, and suppressing function arguments for notational 
compactness, takes the form 


Vt + inf I ^tr(ua,a;VE'^) +vj{f + Gu) +q + 1 = 0, {t, x) G [0, T) x M”, 

u&U z z J 

v{T,x) = g{x), X G M”. 


(4) 


where Vx and Vxx denote the gradient and the Hessian of v, respectively. The term inside the 
brackets is the Hamiltonian, and is denoted by H. Note that this result can be extended to 
include cases where the Value function does not satisfy the smoothness condition. Then, if one 
also considers viscosity solutions of (4), the Value function is proven to be a viscosity solution of 
(4). Furthermore, the viscosity solution is equal to the classical solution, if a classical solution 
exists. For the chosen form of the cost integrand, and assuming that the optimal control lies in the 
interior of U, we may carry out the inhmum operation by taking the gradient of the Hamiltonian 
with respect to u and setting it equal to zero to obtain 

BH 

-^=0 or — Ru — G~^ {t,x)vx{t,x) = 0. (5) 

ou 

Therefore, the optimal control is given by 

u*{t,x) = —R~^G~^{t,x)vx{t,x), {t,x) G [0,r] X M”. (6) 


Inserting the above expression back into the original HJB equation and suppressing again function 
arguments, we obtain the equivalent characterization 


Vt + ^tr(xxxBS’^) +v1f + q- GR ^G~^Vx = 0, {t, x) G [0, T) x M”, 

v{T,x)=g{x), X G M"'. 


3 A Feynman-Kac Type Representation Through FBSDEs 

There is an innate relation between stochastic differential equations and second-order partial dif¬ 
ferential equations of parabolic or elliptic type. Specifically, solutions to a certain class of nonlinear 
PDFs can be represented by solutions to forward-backward stochastic differential equations (FBS¬ 
DEs), in the same spirit as demonstrated by the well-known Feynman-Kac formulas [13] for linear 
PDFs. We begin by briefly reviewing FBSDEs. 


3.1 The Forward and Backward Process 


As a forward process we shall define the square-integrable, {.^s}s>o-adapted process which, 

for any given initial condition {t,x) G [0,T] x M"", satisfies the Ito FSDE 

f dA, = 6(s,X,)ds + S(s,X,)dfF„ s G [t,r], 

\ Xt = x. ^ ^ 


The forward process (8) is also called the state process in the literature. We shall denote the 
solution to the forward SDE (8) as Xl’^, wherein {t,x) are the initial condition parameters. 

In contrast to the forward process, the associated backward process is the square-integrable, 
{=^s}s>o-adapted pair (E(-), Z{-)) defined via a BSDE satisfying a terminal condition 

/ dn = -h{s,X,,Ys,Zs)ds + ZjdWs s G [t,T], 

Yt = a{XT). 


The function h{-) is called generator or driver. The solution is implicitly defined by the initial 
condition parameters {t,x) of the FSDE since it obeys the terminal condition g{x!^^). We will 
similarly use the notation Yg’^ and Zt’^ to denote the solution for a particular initial condition 
parameter {t, x) of the associated ESDE. 

While FSDEs have a fairly straightforward definition, in the sense that both the SDE and 
the filtration evolve forward in time, this is not the case for BSDEs. Indeed, since solutions to 
BSDEs need to satisfy a terminal condition, integration needs to be performed backwards in time 
in some sense, yet the hltration still evolves forward in time. It turns out [17] that a terminal value 
problem involving BSDEs admits an adapted (i.e., non-anticipating) solution if we back-propagate 
the conditional expectation of the process, that is, if we set Yg = E[ys|#s]. 

Notice that the FSDE does not depend on Yg or Zg. Thus, the resulting system of EBSDEs 
is said to be decoupled. If, in addition, the functions b, Ti, h and g are deterministic, in the 
sense that they do not depend explicitly on a; G D, then the adapted solution {Y, Z) exhibits the 
Markovian property; namely, it can be written as deterministic functions of solely time and the 
state process [18]: 


Theorem 1 (The Markovian Property) - There exist deterministic functions v{t,x) and d{t,x) 
such that the solution (Y^’^,Z^’^) of the BSDE (9) is 


Ei’" = vis,xln, zl>^ = j:~^is,xlndis,xln, 


( 10 ) 


for all s G [t, T]. 


3.2 The Nonlinear Feynman-Kac Lemma 

We now proceed to state the nonlinear Feynman-Kac type formula, which links the solution of a 
class of PDFs to that of FBSDEs. Indeed, the following theorem can be proven [15,17,18] by an 
application of Ito’s formula: 

Theorem 2 (Nonlinear Feynman-Kac) - Consider the Cauchy problem 

{vt + ^ir{vxxY‘Y^)-\-v^b{t,x) + h{t,x,v,Y:)^Vx) = d, (t, x) G [0, T) x M", 

I v{T,x) = g{x), X G 

^While X is a function of s and a;, we shall use Xs for notational brevity. 

^By abuse of notation, here (t, x) are symbolic arguments of the functions v and d, and not the initial condition 
parameters as in , Z^’^). Throughout this work, it should be clear from the context whether {t,x) are to be 
understood as initial condition parameters or symbolic arguments. 



wherein the functions S, b, h and g satisfy mild regularity conditions. Then (11) admits a unique 
viscosity solution v : [0, T] x M” — )■ M, which has the following probabilistic representation: 


v{t, x) = F/’*, V(t, x) G [0, T] X M", 


( 12 ) 


where Z{-)) is the unique adapted solution of the FBSDE system (8)-(9). Furthermore, 


for all s G [t,T], and if (11) admits a classical solution, then (12) provides that classical solution. 


A comparison of equations (7) and (11) indicates that the nonlinear Feynman-Kac representa¬ 
tion can be applied to the HJB equation given by (7) under a certain decomposability condition, 
stated in the following assumption: 


Assumption 1 There exists a matrix-valued function F : [0,r] x MT —)> such that G{t,x) = 

T,{t,x)r{t,x) for all {t,x) G [0,T] x M”, satisfying the same mild regularity conditions. 

This assumption implies that the range of G must be a subset of the range of S, and thus excludes 
the case of a channel containing control input but no noise, although the converse is allowed. Under 
this assumption, and suppressing arguments for brevity, the HJB equation given by (7) becomes 


vt + ]^ti{vxxTX^) + v~lf + q- TYR ^F'^S'^Ua; = 0, (t, x) G [0, T) x 

v{T,x) = g{x), X G M", 


(14) 


which satisfies the format of (11) with 


b{t, x) 

= f(t,x) 

(15) 

h{t, x, z) 

= q(t,x) — -z~^r(t, x}R~^T~^(t, x}z. 

(16) 


We may thus obtain the (viscosity) solution of (14) by simulating the system of FBSDE given by 
(8) and (9). Notice that (8) corresponds to the uncontrolled {u = 0) system dynamics. 


4 Approximating the Solution of FBSDEs 

The solution of FBSDEs has been studied to a great extent independently from its connection 
to PDEs, mainly within the field of mathematical finance. Though several generic schemes exist 
[19-21], in this paper we propose a scheme which exploits the regularity present in FBSDEs that 
arise from the application of the nonlinear Feynman-Kac lemma. 

We begin by selecting a time grid {t = to < ... < t^ = T} ioi the interval [t,T], and denote 
by Ati = tj+i — ti the {i -|- l)-th interval of the grid (which can be selected to be constant) and 
AWi = — Wti the {i -\- l)-th Brownian motion increment^. For notational brevity, we also 

denote Xi = Xt-. The simplest discretized scheme for the forward process is the Euler scheme, 
which is also called Euler-Maruyama scheme [22]: 

j Aj_|_i ~ Aj -|- b{ti, Xi)Ati -|- S(tj, Xi)AWi, 

U = l,...,iV, Xo = x. ^ 


^Here, AWi would be simulated as YAU^i, where ^i ~ I). 



Several alternative, higher order schemes exist that can be selected in lieu of the Euler scheme [22]. 
To discretize the backward process, we further introduce the notation Yi = Yt- and Zi = Zt^. 
Then, recalling that adapted BSDE solutions impose Yg = E[y's|^s] and Zg = ¥.[Zg\^g] (i.e., a 
back-propagation of the conditional expectations), we approximate equation (9) by 

E = « E[T,+1 + hiU+uX,+i,Y+uZ+i)AU\X,]. (18) 

Notice that in the last equality the term Zj AlTj in (9) vanishes because of the conditional ex¬ 
pectation {AWi is zero mean), and we replace with Xi in light of the Markovian property 
presented in Section 3.1. By virtue of equation (13), the Z-process in (9) corresponds to the term 
S'''(s, Ai’^)ux(s, xi’*). Therefore we can write 

Zi = nZ^\^u] = E[sT(t„X,)V,u(ti,Xi)|Xi] 

= tJ {ti,Xi)Vg,v{ti,Xi), (19) 

which naturally requires knowledge of the solution at time ti on a neighborhood x, v{ti,x). The 
backpropagation is initialized at Y^ = giY^) and Z^ = E(T, XT)~^Y'xg{XT), for a g{-) which is dif¬ 
ferentiable almost everywhere. There are several ways to approximate the conditional expectation 
in (18), however in this work we shall employ the Least Squares Monte Carlo (LSMC) method^, 
which we shall briefly review in what follows. 

The LSMC method addresses the general problem of numerically estimating conditional ex¬ 
pectations of the form E[y|X] for square integrable random variables X and Y, if one is able to 
sample M independent copies of pairs (X, T). The method itself is based on the principle that 
the conditional expectation of a random variable can be modeled as a function of the variable 
on which it is conditioned on, that is, E[y|X] = (p*{X), where cj)* solves the infinite dimensional 
minimization problem 

())* = argminE[|(/>(X) - yp], (20) 

and (p ranges over all measurable functions with E[|(/)(X)p] < oo. A finite-dimensional approx¬ 
imation of this problem can be obtained if one decomposes ())(•) = with 

ip{-) being a row vector of predetermined basis functions and a a column vector of constants, 
thus solving a* = argmin^gjgi; E[|(/?(X)a — yp], with k being the dimension of the basis. Einally, 
this problem can be simplified to a linear least-squares problem if one substitutes the expectation 
operator with its empirical estimator [24], thus obtaining 

1 ^ 

a* = argmin — V| 9 ?(X^)a-y^|^ (21) 

aeM*’ M ^ 


wherein (X^,Y^), j = 1,..., M are independent copies of (X, y). Introducing the notation 

■^(xi) 


cl>(X) = 


G 


nMxk 


( 22 ) 


<^(X^) 

the solution to this least-squares problem can be obtained by directly solving the normal equation, 

yi 


i.e. 


a* = ( $^(X)$(X) ) $'(X) 


-1 


y 


M 


(23) 


'^Treating conditional expectations by means of linear regression was made popular in the field of mathematical 
finance by [23]. 









or by performing gradient descent. The LSMC estimator for the conditional expectation assumes 
then the form E[y|X = x] = « ^p{x)q*. 

Returning to our problem, we may apply the LSMC method to approximate the conditional 
expectation in equation (18) for each time step. To this end, we require a vector of basis functions (p 
for the approximation of Although the basis functions can be different at each time step, 

we shall use the same symbol for notational simplicity. Then, Monte Carlo simulation is performed 
by sampling M independent trajectories {A^™}i=i,,,.,7V; in which the index m = 1,..., M specifies 
a particular Monte Carlo trajectory. Whenever this index is not present, the entirety with respect 
to this index is to be understood. The numerical scheme is initialized at the terminal time T 
and is iterated backwards along the entire time grid, until the starting time instant has been 
reached. At each time step ti, we are given M pairs of data on which we perform 

linear regression to estimate the conditional expectation of 1) as a function of x at the time step 
ti- This provides us an approximation of the Value function v at time ti for the neighborhood 
of the state space that has been explored by the sample trajectories at that time instant, since 
v{ti,x) = E[Yi\Xi = x] Ri (p{x)ai. We then replace Yf^ = E[V^™|A™] ss thereby treating 

the conditional expectation as a projection operator. Finally, the approximation of the conditional 
expectation of Zi is obtained by taking the gradient with respect to x on v{ti,x), evaluating it at 
XY', and scaling it with S 

This process is repeated for ti-i,... ,ti- Note that this approach requires the basis functions </?(•) 
of our choice to be differentiable almost everywhere, so that XxPix) is available in analytical form 
for almost any x. The proposed algorithm is then summarized as 


Initialize : Yt = giXr), Zt = E{T, XrVVxgiXT), 

ai = argmin^ ^{Xi)a - ( V+i + W+i, V+i, Zj+i) 

am \ 

V = HX^ai, ZY = Y{ti, XYVXxip{XY)a,, 


where m = 1,... ,M and the matrix <I> defined in (22). Again, the minimizer can be obtained by 
directly solving the normal equation, i.e.. 


0.1 = 




‘I>^(V,) 


^V+i T Xtih{ti-\-i , Xij-i , L)+i) 



(25) 


or by performing gradient descent. The essential algorithm output is the collection of a^’s, that 
is, the basis function coefficients at each time instant, which are needed to recover the Value 
function approximation for the particular area of the state space that is explored by the forward 
process. This is in contrast with methods that calculate the solution over an entire pre-specified 
grid (and thus typically exhibit bad scalability), but also differs from local trajectory optimization 
methods which consider only infinitesimal variations around a nominal trajectory. Furthermore, 
an important difference between the proposed method and forward sampling based methods is that 
the latter provide a solution only for the point of the initial condition {t,x), while the solution of 
this method covers an area starting from the initial condition (t, x), expanding in state space until 
time T is reached. 

®Here, denotes the quantity + Atih{ti+i, XYi,YYi, ZYi): which is the V™ sample value before the 
conditional expectation operator has been applied. 








5 Learning Control: An Iterative Scheme based on Importance 
Sampling 

The proposed method, as it has been presented so far, suffers from a significant limitation. Namely, 
its ability to provide approximations to the value function is restricted to only those areas of the 
state space that are reachable by unforced dynamics (eq. (8)). Indeed, there are several cases of 
systems in which the goal state practically cannot be reached by the uncontrolled system dynamics 
(consider, for example, an inverted pendulum). Furthermore, even in the case in which the target 
state is indeed reached by unforced trajectories, as the dimensionality of the state space increases, 
the density of sample trajectories along any given path from the initial state to the target state 
reduces quickly, thus increasing the demand for available samples. These issues can be eliminated 
if one is given the ability to modify the drift term of the sampled trajectories. Specifically, by 
changing the drift, we can direct the exploration of the state space towards the goal state, or any 
other state of interest, reachable by control. As will be shortly demonstrated, such a scheme can 
indeed be constructed by means of a careful application of Girsanov’s theorem on the change of 
measure. Applying importance sampling on FBSDEs is not an entirely new concept, as it was hrst 
introduced as a variance reduction technique [25]. Through Girsanov’s theorem, one may alter the 
drift of the forward process if this modification is appropriately compensated for in the backward 
process. That is, the system of FBSDEs given by equations (8) and (9) is in some certain sense 
equivalent to one with modified drift 

dX, = [5(s, Xs) + E(s, Xs)Ks]ds + S(s, X,)dIT„ 

Xt = x. 

along with the compensated BSDE 

dYs = [-h{s, Xs,Ys, Zs) + ZjKs]ds + Zj dVF,, 

Yt = 9{Xt), 

for any measurable, bounded and adapted process K : [0, T] —)• M^. Equivalence is not path-wise 
of course, since the paths realized by both the forward and the backward processes will be different 
under the modihed drift dynamics. However, the solution at starting time t, that is {Yt,Zt), will 
remain unaffected. In other words, the estimate of the Value function at the initial condition 
(t, x) is independent of the drift term modification, as will be proven shortly. Indeed, following 
Girsanov’s Theorem [13,26], we define a new measure Q with dQ(a;) = M(T;t,oj)dF(uj), where 

Ms ^ exp Kj dHA - ^ iKrl^dr'^ , s e [t,T], 

is the process of Radon-Nikodym derivatives dQ^^^/dP*-®^ with and being the restrictions 
of Q and P to respectively. Then, Mg is a P-martingale, the P-law of {X, Y, Z) is the same as 
the Q-law of (X, Y, Z), and 

Ws = Krdr + Ws, s£[t,T], 

is a Brownian motion under Q. In fact, defining the Q-Brownian increment dWg = Kgdt + dWg, 
it becomes evident that equations (26) and (27) are simply copies of the dynamics of equations 
(8) and (9), if one substitutes dWg in the latter with dWg. Now, notice that since at the time of 
initialization, t, Mt is by construction equal to one with probability one (in both P and Q-measure), 
the measures P and Q restricted to are equal, and therefore the pairs {Yt,Zt) and {Yt,Zt) are 


■s £ [t, T], 


(26) 


s £ [i, T], 


('27'! 


equal in expectation as well. This proves that the Value function at the initial condition (t, x) is 
independent of the drift term modification. An additional intuitive, albeit informal, explanation of 
why the modified system of FBSDEs (26), (27) can be used in lieu of the original FBSDE system 
is readily obtained if one examines the associated PDEs. Indeed, the EBSDE problem defined by 
(26) and (27) corresponds to the PDE problem 

{ vt-\- ^tT{vxx^^^) + vj{b + T,K) + h{t, X, V, Vx) - uj YK = 0, {t, x) G [0, T) x 

\ v{T,x) = g{x), 

which of course is identical to the PDE problem (11), as we have merely added and subtracted the 
term v^YK. Thus, although the FBSDEs are different, they are associated with the same PDE 
problem. 

Returning to the original problem formulation and recalling the definition of F(-) in Assumption 
1, we may apply any nominal control u to the state dynamics in order to obtain the modified drift 
system, which exhibits the form 

dx(t) = [f(t, x{t)) + Y{t, x{t))T(t, x{t))u{t)]dt + Y{t, x{t))dWt. (28) 

Thus, the controlled system trajectories are samples from the forward process (26) with 

Ks = T{s,Xs)u{s), s€[t,T], (29) 

while b{s, X^) = f(s, Xg) as per (15). Notice that the nominal control u may be any open or closed- 
loop control, a random control, or even a control calculated by a previous run of the algorithm. 
In the latter case, one obtains a more refined solution, thus arriving at an iterative scheme. For 
the discrete representation on the time grid of Section 4, we define Ki = Kt^. The forward process 
can again be sampled using the Euler-Maruyama scheme. There are several equivalent ways to 
incorporate importance sampling in the backward process, however the most straightforward way 
is to simply define 

h{s, X, y, z, k) = h{s, x, y, z) — z'^k, (30) 

and utilize the discretized scheme presented in Section 4 using h instead of h. 

6 Simulation Results 

To evaluate the algorithm’s performance, we simulated the algorithm on an inverted pendulum 
and a cart-pole system. These simulations demonstrate that the nonlinearity in the dynamics is 
handled efficiently, and furthermore illustrate the signihcance of the iterative scheme which features 
importance sampling. 

6.1 The Inverted Pendulum 

The equations of motion for the inverted pendulum are given by 

mi‘^9 + bO — mgi sin 6 = u, (31) 

and stochasticity enters the system in form of perturbations in the torque u. For the purposes 
of this simulation, two thousand trajectories were generated on a time grid of 0.004 with time 
horizon T=2. The system noise covariance was set on 0.1. No initial guess for the control input 
was necessary, though a white noise signal has been injected in the control input during the 
sampling stages to increase variation in the trajectories, since the system noise intensity is low. 
For the basis of the Value function approximation, modified Chebyshev polynomials [27] up to 


second order have been selected. The scheme was repeated for 15 iterations, with the algorithm 
successfully learning the optimal control to invert and stabilize the pendulum. Figure 1 depicts 
the mean of the controlled trajectories for each algorithm iteration (gray scale). The trajectories 
after the final iteration are shown in red. Finally, Figure 2 depicts the convergence of the cost 
mean and standard deviation as the iterative scheme progresses. 



Figure 1: Inverted Pendulum: Trajectory mean for the position (left) and velocity (right) of the controlled 
system for each iteration (gray scale) and after the hnal iteration (red). The black dots represent the target 
states. 



Figure 2: Inverted Pendulum: Cost mean +- 3 standard deviations per iteration. 


6.2 The Cart-Pole system 

To assess the efficiency of the proposed scheme in underactuated systems, we simulated the algo¬ 
rithm on a cart-pole system (see Figure 3). The equations of motion are given by 

X = -y— (u — rup sin + g cos 9) ), (32) 

rric -b mp sin 6 \ ) 

9 = — -^—- ( ucos 9 — mp(.(f‘ cos 9 sin 9 + {rric + mp)gsm9 ], (33) 

i{mc + rUp sin 9) \ ) 

and stochasticity enters the system in form of perturbations in u. To this end, five thousand 
trajectories were generated on a time grid of 0.004 with time horizon T=3. The system noise 









covariance was set on 1. Again, no initial guess for the control input was necessary, and a white 
noise signal has been injected in the control input during the sampling stages to increase variation 
in the trajectories, since the system noise intensity is low. For the basis of the Value function 
approximation, modified Chebyshev polynomials up to second order have been selected. The 
scheme was repeated for 35 iterations. Figure 4 depicts the mean of the controlled trajectories for 
each algorithm iteration (gray scale). The trajectories after the final iteration are shown in red. 
Finally, Figure 5 depicts the convergence of the cost mean and standard deviation as the iterative 
scheme progresses. 


F 



Figure 3: Cart pole: rric denoted the mass of the cart, rup denotes the mass of the pole and I is the length 
of the pole. 





Figure 4: Cart-pole: Clockwise starting at the top left- cart position, cart velocity, pole velocity, pole 
position. Trajectory mean of the controlled system for each iteration (gray scale) and after the final iteration 
(red). The black dots represent the target states. 


7 Conclusions 

In this paper we proposed a new algorithm for nonlinear stochastic control problems with dynamics 
affine in control and cost functions that are non-quadratic in the state and quadratic in controls. In 
light of a nonlinear Feynman-Kac lemma which establishes a connection between certain PDFs and 
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Figure 5: Cart-pole: Cost mean -I— 3 standard deviations per iteration. 


SDEs, we were able to obtain a probabilistic representation of the solution to the nonlinear HJB 
PDF, expressed as a system of FBSDEs. This system is then simulated using linear regression. 
Finally, in order to enhance the algorithm’s efficiency in treating more complex nonlinear systems, 
we proposed an iterative scheme based on Girsanov’s theorem on the change of measure, which 
features importance sampling. We demonstrated the ability of the proposed iterative algorithm to 
learn the optimal controls without an initial guess in an inverted pendulum system and a cart-pole 
system. 
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